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Abstract 

We present evidence, based on play-by-play data from all 6087 games 
from the 2006/07-2009/10 seasons of the National Basketball Association 
(NBA), that basketball scoring is well described by a continuous-time anti- 
persistent random walk. The time intervals between successive scoring events 
follows an exponential distribution, with essentially no memory between dif- 
ferent scoring intervals. By including the heterogeneity of team sti"engths, we 
build a detailed computational random-walk model that accounts for a variety 
of statistical properties of scoring in basketball games, such as the distribution 
of the score difference between game opponents, the fraction of game time 
that one team is in the lead, the number of lead changes in each game, and the 
season win/loss records of each team. 
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1 Introduction 



Sports provide a rich laboratory in which to study competitive behavior in a well- 
defined way. The goals of sports competitions are simple, the rules are well defined, 
and the results are easily quantifiable. With the recent availability of high-quality 
data for a broad range of performance metrics in many sports (see, for example, 
[shrpsports . com), it is now possible to address questions about measurable aspects 
of sports competitions that were inaccessible only a few years ago. Accompanying 
this wealth of new data is a rapidly growing body of literature, both for scientific and 
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In this spirit, our investigation is motivated by the following simple ques- 
tion: can basketball scoring be described by a random walk? To answer this ques- 
tion we analyze play-by-play data for four seasons of all National Basketball As- 
sociation (NBA) games. Our analysis indicates that a simple random- walk model 
successfully captures many features of the observed scoring patterns. We focus on 
basketball primarily because there are many points scored per game — roughly 100 
scoring events in a 48-minute game — and also many games in a season. The large 
number of scoring events allows us to perform a meaningful statistical analysis. 

Our random walk picture addresses the question of whether sports perfor- 
mance metrics are determin ed by memory-less stochastic processe s or by processes 
with long-time correla ti ons (iGilovich. Vallone. and Tverskvl(ll985h . iMiller and W einbergI 



199ll) . iGouldl d 19961) . lOvte and Clarkd (l2000h . lEverson and Go ldsmith-PinkhamT 
((20081)). To the untrained eye, streaks or slumps — namely, sustained periods 
of superior or inferior performances — seem so unusual that they ought to have 
exceptional explanations. This impression is at odds with the data, however. Im- 
partial analys is of individual player data in basketball has disc redited the notion of 



a 'hot hand' (IGilovich et al.l ( 



1985h . lAyton and Fisched (|2004f) ). Rather, a player's 



shooting percentage is independent of past performance, so that apparent streaks 
or slumps are simply a consequence of a series of random unco rrelated scorin g 
events. Similar l y, in baseball, teams do not get 'hot' or 'cold' (|Verginl (|2000[) . 



Sire and Redned (|2009l) ): instead, the functional forms of winning and losing streak 



distributions arise from random statistical fluctuations. 

In this work, we focus on the statistical properties of scoring during each 
basketball game. The scoring data are consistent with the scoring rate being de- 
scribed by a continuous-time Poisson process. Consequently, apparent scoring 
bursts or scoring droughts arise from the Poisson statistics rather than from a tem- 
porally correlated process. Our main hypothesis is that the evolution of the score 



difference between two competing teams can be accounted by a continuous-time 
random walk. 

This idealized picture of random scoring has to be augmented by two fea- 
tures — one that may be ubiquitous and one idiosyncratic to basketball. The former 
is the existence of a weak linear restoring force, in which the leading team scores 
at a slightly lower rate (conversely, the losing team scores at a slightly higher rate). 
This restoring force seems to be a natural human response to an unbalanced game — 
a team with a large lead may be tempted to coast, while a lagging team likely plays 
with greater urgency. A similar "rich get poorer" and "poor get richer" phenomenon 
was foun d in economic competitions where each interaction has lowdecisiveness 
jPurham. Hirschleifer. and SmithI (ll998b . lGarfinkel and SkaperdasI JlOOvb '). Such a 
low payoff typifies basketball, where the result of any single play is unlikely to de- 
termine the outcome of the game. The second feature, idiosyncratic to basketball, 
is anti-persistence, in which a score by one team is more likely to be followed by a 
score from the opponent because of the change in ball possession after each score. 
By incorporating these attributes into a continuous-time random-walk description 
of scoring, we build a computational model for basketball games that reproduces 
many statistical features of basketball scoring and team win/loss records. 

2 Scoring Rate 

Basketball is played between two teams with five players each. Points are scored 
by making baskets that are each worth 2 points (typically) or 3 points. Additional 
single-point baskets can occur by foul shots that are awarded after a physical or 
technical foul. The number of successive foul shots is typically 1 or 2, but more can 
occur. The duration of a game is 48 minutes (2880 seconds). Games are divided 
into four 12-minute quarters, with stoppage of play at the end of each quarter. The 
flow of the game is ostensibly continuous, but play does stop for fouls, time-outs, 
and out-of-bounds calls. An important feature that sets the time scale of scoring is 
the 24-second clock. In the NBA, a team must either attempt a shot that hits the 
rim or score within 24 seconds of gaining possession of the ball, or else possession 
is forfeited to the opposing team. At the end of the game, the team with the most 
points wins. 

We analyze play-by-play data from 6087 NBA games for the 2006/07- 
2009/10 seasons, including playoff games (see www . basket ballvalue . comp : for 
win/loss records we use a larger dataset for 20 NBA seasons (www . shrpsports . com) 
To simplify our analysis, we consider scoring only until the end of regulation time. 
Thus every game is exactly 48 minutes long and some games end in ties. We omit 



overtime to avoid the complications of games of different durations and the pos- 
sibility that scoring patterns during overtime could be different from those during 
regulation time. 

We focus on what we term scoring plays, rather than individual baskets. A 
scoring play includes any number of baskets that are made with no time elapsed be- 
tween them on the game clock. For example, a 2-point play could be a single field 
goal or two consecutive successful foul shots; a 3-point play could be a normal field 
goal that is immediately followed by a successful foul shot, or a single successful 
shot from outside the 3-point line. High- value plays of 5 and 6 points involve mul- 
tiple technical or flagrant fouls. Since they have negligible probability of occurence 
(Table [B, we will ignore them in our analysis. Consistent with our focus on scoring 
plays, we define the scoring rate as the number of scoring plays per second. This 
quantity is measured for each second of the game. For the 4 seasons of data, the 
average scoring rate is roughly constant over the course of a game, with mean value 
of 0.03291 plays/sec (Fig.[T]). Averaging each quarter separately gives a scoring 
rate of 0.03314, 0.03313, 0.03243, and 0.03261 for first through fourth quarters, re- 
spectively. The scoring rate corresponds to 94.78 successful plays per game. Since 
there is, on average, 2.0894 points scored per play, each team has 99.018 points 
in an average game ( Westfalll (|1990h ). Parenthetically, the average scoring rate is 
constant from season to season, and equals 0.03266, 0.03299, 0.03284, 0.03315 for 
the 2006-07 to the 2009-10 seasons. 



Points per Basket 


Percentage 


Ipt. 

2 pts. 

3 pts. 


33.9% 
54.6% 
11.5% 



Points per Play 


Percentage 


Ipt. 


8.70% 


2 pts. 


73.86% 


3 pts. 


17.28% 


4 pts. 


0.14% 


5 pts. 


0.023% 


6 pts. 


0.0012% 



Table 1 : Point values of each basket (left) and each play (right) and their respective 
percentages. 



Curiously, significant deviations to the constant scoring rate occur near the 
start and end of each quarter (Fig. [IJa)). During roughly the first 10 seconds of 
each quarter, scoring is unlikely because of a natural minimum time to make a 
basket after the initiation of play. Near the end of each of the first three quarters, 
the scoring rate first decreases and then sharply increases right at the end of the 
quarter. This anomaly arises because, within the last 24 seconds of the quarter, 
teams may intentionally delay their final shot until the last moment, so that the 
opponent has no chance for another shot before the quarter ends. However, there is 
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Figure 1 : (a) Average scoring rate as a function of time over all games in our dataset. 
(b) Rate near the change of each quarter; zero on the abscissa corresponds to the 
start/end of a quarter. 



only an increase in the scoring rate before the end of the game, possibly because of 
the urgent effort of a losing team in attempting to mount a last-minute comeback via 
intentional fouls. While these deviations from a constant scoring rate are visually 
prominent, they occur over a small time range near the end of each quarter. For 
the rest of our analysis, we ignore these end-of-quarter anomalies and assume that 
scoring in basketball is temporally homogeneous. 

In addition to temporal homogeneity, the data suggest that scoring frequency 
obe ys a Poisson-like process, with little memory between successive scores (see 



alsolde Saa Guerra. Gonzalez. Montesdeoca. Ruiz. Arjonilla-Lopez. and Garca-Manso 



(120 To illustrate this property, we study the probability P{t) of time intervals 
between successive scoring plays. There are two natural such time intervals: (a) the 
interval ts between successive scores of either team, and (b) the interval ?s between 
successive scores of the same team. The probability P{te) has a peak at roughly 16 
seconds, which evidently is determined by the 24-second shot clock. This proba- 
bility distribution decays exponentially in time over nearly the entire range of data 
(Fig. 121). Essentially the same behavior arises for P{ts), except that the time scale is 
larger by an obvious factor of 2. When all the same-team time intervals are divided 
by 2, the distributions P{te) and P{ts) overlap substantially. The long-time tails of 
both P{te) and 2P{t^/2) are proportional to the exponential function exp(— AtaiiO^ 
with rate Ataii = 0.048 plays/sec. This value is larger than the actual scoring rate of 
0.03291 plays/sec because scoring intervals of less than 10 seconds are common for 
the exponential distribution but are rare in real basketball games. Amusingly, the 
longest time interval in the dataset for which neither team scored was 402 seconds, 
while the longest interval for which a single team did not score was 685 seconds. 
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Figure 2: Probability distributions of time intervals between successive scores for 
either team, P{te) vs. (a), and for the same team, P{h) vs. ?s (b). The line is the 
least-squares linear fit of ln{P) vs. t over the range > 30 sec and ?s > 60 sec and 
corresponds to a decay rate Ataii = 0.048 and 0.024, respectively. 



It is instructive to compare the distribution of total score in a single game to 
that of a Poisson process. Under the assumption that scores occur at the empirically- 
observed rate of A = 0.03291 plays/sec, the probability that a game has k scor- 
ing plays is given by the Poisson distribution, Prob(# plays = k) = ^(Ar)^e^'^^, 
where T = 2880 sec. is the game duration. Since the average score of each play 
is 5 = 2.0894 points, a game that contains k scoring plays will have a total score 
of approximately S = sk. By changing variables from to 5 in the above Poisson 
distribution, the probability that a game has a total score S is 



1 (Ar)^/^ 



-XT 



Prob(score = S) = — — . (1) 

s {S/s)l 

This probability agrees reasonably with game data (Fig. [3]), considering that ^ is 
derived using only the mean scoring rate and mean points per play. By including the 
different point values for each play, the resulting score distribution would broaden. 
Furthermore, if we impose a cutoff in the probability of short scoring intervals (see 
Fig. O the total score distribution of Fig. [3] would shift slightly left which would 
bring the model prediction closer to the data. 

An important aspect of the time intervals between successive scoring events 
is that they are weakly correlated. To illustrate this feature, we take the time-ordered 
list of successive sco ring intervals t] , • • for all games and compute the n-lag 
correlation function (|Box and Jenkins 

C[n) = — — -2 . (2) 
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Figure 3: Probability Prob(score = S) for a total score 5 in a single game. Circles 
are the data, and the solid curve is the Poisson distribution ([T]). 



Thus n = I gives the correlation between the time intervals between successive 
scores, n = 2 to second-neighbor score intervals, etc. For both the intervals (in- 
dependent of which team scored) and (single team), we find that C{n) < 0.03 
for n> 1. Thus there is little correlation between scoring events, suggesting that 
basketball scoring is a nearly memory-less process. Accordingly, scoring bursts or 
scoring droughts are nothing more than manifestations of the fluctuations inherent 
in a Poisson process of random and temporally homogeneous scoring events. 



3 Random-Walk Description of Scoring 

We now turn to the question of which team scores in each play to build a random- 
walk description of scoring dynamics. After a given team scores, possession of 
the ball reverts to the opponent. This change of possession confers a significant 
disadvantage for a team to score twice in succession. On average, immediately after 
a score, the same team scores again with probability q = 0.348, while the opponent 
scores with probability 0.652. Thi s tendency for alterna ting scores is characteristic 



of an anti-persistent random walk (|Garcfa-Pelayd (120071) ). in which a step in a given 



direction is more likely to be followed by a step in the opposite direction. 

As we now discuss, this anti -persistence is a determining factor in the streak- 
length distribution. A streak of length s occurs when a team scores a total of s con- 
secutive points before the opposing team scores. We define Q{s) as the probability 



for a streak to have length s. To estimate this streak-length probability, note that 
since s = 2.0894 points are scored, on average, in a single play, a scoring streak of 
s points corresponds to s/s consecutive scoring plays. In terms of an anti-persistent 
random walk, the probability Q{s) for a scoring streak of s points is Q{s) = Aq^/^ 
where A = q^^/^ — 1 is the normalization constant. This simple form reproduces 
the observed exponentially decaying probability of scoring streaks reasonably ac- 
curately (Fig. ID). 
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Figure 4: Probability Q{s) for a consecutive point streak of s points (o). The dashed 
line corresponds to Q{s) =Aq'^)\ with q = 0.348 and A the normalization constant. 
The solid line corresponds to a refined model that incorporates the different proba- 
bilities of 1, 2, 3, and 4-point plays (see Eqs. ^ and ([5])). 



However, we can do better by constructing a refined model that incorporates 
the different probabilities for 1, 2, 3, and 4 point plays. Let Wa be the probability 
that a play is worth a points (Table [T]) and let v„, be the value of the m* play 
in a streak. A scoring sequence {vi,... v„} that results in s points must satisfy 
the constraint Lfc=i Vk = s, where n is the number of plays in the sequence. The 
probability for this streak is given by 11^= i i^vj. • Because a streak of length s points 
involves a variable number of plays, the total probability for a streak of s points is 



Qis) = I 




(3) 



Here the inner sum is over all allowed sequences {v^} of n consecutive point- 
scoring events, and the factor ^"^^(1 — q) gives the probability for a streak of ex- 
actly n plays. For example, the probabilities for streaks up to 5 = 4 are: 



A direct calculation of these probabilities for general s becomes tedious for 
large s, but we can calculate them recursively for 5 > 4. To do so, we decompose 
a streak of s points as a streak of * — v„ points, followed by a single play that of v„ 
points. The probability of such a play is qwv„. Because the last play can be worth 
1, 2, 3, or 4 points, the probability for a streak of length s is given recursively by 



Q{s) = q[wiQ{s W2Q{s ~2)+ W3Q{s - 3) + W4Q{s - 4)] . (5) 



Using Eqs. (HI) and ©, we may calculate Q{s) numerically for any s. The resulting 
probabilities closely match the empirical data (Fig.U), suggesting that streaks arise 
only from random statistical fluctuations and not from teams or individuals getting 
hot or cold. 

Another intriguing feature of basketball games is that the scoring probability 
at any point in the game is affected by the current score: the probability that the 
winning team scores decreases systematically with its lead size; conversely, the 
probability that the losing team scores increases systematically with its deficit size 
(Fig. [5]). This effect is well-fit by a linear dependence of the bias on the lead (or 
deficit) size. (Such a linear restoring force on a random walk is known i n the physics 
literature as the Omstein-Uhlenbeck model / Uhlenbeck and OrnsteinI ([l930)). For 
basketball, the magnitude of the effect is small; assuming a linear dependence, a 
least-squares fit to the data gives a decrease in the scoring rate of 0.0022 per point 
of lead. Naively, this restoring force originates from the winning team 'coasting' or 
the losing team increasing its level of effort. 

We now build a random-walk picture for the time evolution of the difference 
in the score A{t) between two teams. Each game starts scoreless and A{t) subse- 
quently increases or decreases after each scoring play until the game ends. The 
trajectory of A{t) versus t qualitatively resembles the position of a random walk 
as a function of time. Just as for random walks, the statistically significant quan- 
tity is = var(A(?)), the variance in the score difference, averaged over many 
games. For a classic random walk, = 2Dt, where D is the diffusion coefficient. 
As illustrated in Fig. [6l does indeed grow nearly linearly with time for NBA 
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Figure 5: Data for the probability 5(L) that a team will score next given a lead L 
(o). The line is the least-squares linear fit, 5(L) = ^ — 0.0022L. 



basketball games, except for the last 2.5 minutes of the game; we will discuss this 
latter anomaly in more detail below. A least-squares linear fit to all but the last 2.5 
minutes of game data gives = IDf^^t, with = 0.0363 points^/sec. 

We may also independently derive an effective diffusion constant from the 
time evolution of the score difference from basic parameters of an anti-persistent 
random walk. For such a walk, two successive scores by the same team correspond 
to two random-walk steps in the same direction. As mentioned above, we found 
that the probability of this outcome is ^ = 0.348. Conversely, the probability for 
a score by one team immediately followed with a score by the opposing team is 
I —q. Let us define /'(A, t) as the probability that th e score difference equals A at 



time t. Using the approach of iGarcia-Pelay d (|2007|) for an anti-persistent random 



walk, i'(A, t) obeys the recursion 

P{A, t + T)= qP{A -i,t)+ qP{A + i,t) + [{1 - qf - q^]P{A, t-x), (6a) 

where i is the point value of a single score. To understand this equation, we rewrite 
it as 

P{A, t + T)= q[P{A -i,t)+ P{A + i,t)- P{A, t-x)] + {l- q)P{A, t-x). (6b) 



The second factor in (I6bl) corresponds to two scores by alternating teams; thus the 
score difference equals A at time t — x and again at time t + x. This event occurs 
with probability l—q. The terms in the square bracket correspond to two successive 
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Figure 6: Variance in the score difference, a^, as a function of time. The line 
= IDfitt is the least-squares linear fit, excluding the last 2.5 minutes of data. 
The variance reaches its maximum 2.5 minutes before the end of the game (dashed 
line). 



scores by one team. Consequently a score difference of A ± 2£ at time t — x evolves 
to a score difference A at time t-\-x. Thus the corresponding walk must be at A ± £ 
at time t but not at A at time t — x. 

Expanding P(A, t) in Eq. (|6al ) to first order in t and second order in A yields 

dP q f d^P _ 

dt {l-q)2TdA2~ '^dA'^' 

where Dap is the effective diffusion coefficient associated with an anti-persistent 
random walk. Notice that forq = j the score evolution reduces to a simple sym- 
metric random walk, for which the diffusion coefficient is Dap = (2t). Substi- 
tuting in the values, from the game data, q = 0.348 (probability for the same team 
to score consecutively), i = 2.0894 (the mean number of points per scoring event), 
and T = 30.39 seconds (the average time between successive scoring events), we 
obtain 

D., = = 0,0383 (P^. (8) 

1— ^2t sec 

This diffusion coefficient is satisfyingly close to the value Dgt = 0.0363 from the 
empirical time dependence o^, and suggests that an anti-persistent random-walk 
accounts for its time dependence. We attribute the small discrepancy in the two 



estimates of the diffusion coefficient to our neglect of the linear restoring force in 
the diffusion equation (|7]), 

Thus far, we have treated all teams as equivalent. However, the influence of 
team strengths on basketball scoring is not decisive — weaker teams can (and do) 
win against better teams. The data show that the winning team in any game has a 
better season record than the losing opponent with probability 0.6777. Thus within 
our random-walk picture, the underlying bias that arises from the disparity in the 
strengths of the two competing teams is masked by random- walk fluctuations. For a 
biased random walk with bias velocity v and diffusion coefficient D, the competition 
between t he bias an d fluct u ations is quantified by the Peclet number Pe = v^t /ID 
(see, e.g.. IProbstein ( 1994 ). Redneil ( 200 lb ), the ratio of the average displacement 
squared {vt)^ to the mean-square displacement 2Dt caused by random-walk fluctu- 
ations. For Pe <^\, bias effects due to disparities in team strengths are negligible, 
whereas for Pe ^ I the bias is important. For basketball, we estimate a typical 
bias velocity from the observed average final score difference, |A| ^ 10.7 points, 
divided by the game duration oi t = 2880 seconds to give v ~ 0.0037 points/sec. 
Using D ^ 0.0363 points^/sec, we obtain Pe ~ 0.55, which is small, but not negli- 
gible. Consequently, the bias arising from intrinsic differences in team strengths is 
typically not large enough to predict the outcome of typical NBA basketball games. 

Finally, the scoring anomaly associated with the last 2.5 minutes of the game 
is striking. If the score evolves as an anti-persistent random walk, the distribution 
of the score difference should be Gaussian whose width grows with time as \/Dt. 
As shown in Fig.|7l the distribution of score difference has a Gaussian appearance, 
with a width that grows slightly more slowly than \/Dt. We attribute this small de- 
viation to the weak restoring force, which gives a diffusion constant that decreases 
with time. However, in the final 2.5 minutes of the game, the score-difference dis- 
tribution develops a spike at A = and dips for small |A|. Thus close games tend 
to end in ties much more often than expected from the random-walk picture of the 
score evolution. This anomaly may stems from the losing team playing urgently to 
force a tie, a hypothesis that accords with the observed increase in scoring rate near 
the end of the game (Fig. [T]). 



4 Computational Model 

From all of the empirical observations about scoring, we now construct a compu- 
tational random- walk model that broadly accounts for point- scoring statistical phe- 
nomena, as well as the win/loss record of all teams at the end of the season. In our 
model, games are viewed as a series of temporally homogeneous and uncorrelated 
scoring plays. The time between plays is drawn from a Poisson distribution whose 
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Figure 7: Probability for a given score difference at the end of the first quarter, 
after 45.5 minutes, and at the end of the game. The abscissa is rescaled by linear 
fit of variance, ^ ^Dfnt (see Fig. (6]). The dashed curve is the distribution from 
simulated games with team strength variance, a| — 0.0083 (see Sec. 4). 



mean is the observed value of 30.39 seconds. We ignore the short-lived spikes 
and dips in the scoring rate at the end of each quarter (Fig. [I]) and also the very rare 
plays of 5 or 6 points. Thus plays can be worth 1, 2, 3, or 4 points, with correspond- 
ing probabilities drawn from the observed distribution in Table \\\ Simulations of 
scoring events continue until the final game time of 48 minutes is reached. 

There are three factors that determine which team scores. First, the bet- 
ter team has a greater intrinsic chance of scoring. The second factor is the anti- 
persistence of successive scoring events that arises from the change of possession 
after a score. The last is the linear restoring force, in which the scoring probabil- 
ity of a team decreases as its lead increases (and vice versa for a team in deficit). 
We therefore write the probabilities Pa and Pb that team A or team B scores next, 
immediately after a scoring event, as: 



= 4 _ 0. 1 52r - 0.0022A, 
Pb = Ib + 0. 152r + 0.0022A. 



(9) 



Here Ia and Ib are the intrinsic scoring probabilities (which must satisfy Ia+Ib = 1; 
and the term ±0.152r accounts for the anti-persistence. Here r is defined as 



+ 1 team A scored previously, 
— 1 team B scored previously, (10) 
first play of the game, 

and ensures that the average probability for the same team to score twice in succes- 
sion equals the observed value of 0.348. Finally, the term 0.0022A (with A the score 
difference) accounts for the restoring force with the empirically measured restoring 
coefficient (Fig. [5]). 

In our minimalist model, the only distinguishing characteristic of team a 
is its intrinsic strength We estimate team strengths by fitting simulated team 
win/loss records to that p redicted by the classic Bradley-Terry competition model 



(|Bradley and Terry! (|1952|) ). in which the intrinsic scoring probabilities are given by 



Xa+Xb ^ ^ Xa+Xb 

To simulate a season, we first assign a strength parameter to each team that is 
fixed for the season. We assume that the distribution of strengths is drawn from 



a Gau ssian distribution with average /i^ and variance a| (I James. Albert, and Stem 



(|1993|) ). Nearly identical results arise for other team strength distributions. Since 
the intrinsic probabilities, Ia and Ib, depend only on the strength ratio Xa/Xb, we 
may choose jix = I without loss of generality, so the only free parameter is crj. 
We determine by simulating many NBA seasons for a league of 30 teams for a 
range of values and comparing the simulated probability distributions for vari- 
ous fundamental game observables with corresponding empirical data. 

Specifically, we examined: (i) The distribution of a given final score dif- 
ference (already shown in Fig. |7]). (ii) The season team winning percentage as a 
function of its normalized rank (Fig. [8] (a)); here, normalized rank is defined so that 
the team with the best winning percentage has rank 1, while the team with worst 
record has rank 0. (iii) The probability for a team to lead for a given fraction of the 
total game time (Fig. [8](b)). (iv) The distribution of the number of lead changes 
during a game (Fig.[8](c)). 

Our motivation for focusing on these measures is that they provide useful 
statistical characterizations of how basketball games evolve. The score difference 
is the most basic information about the outcome of a basketball game. Similarly, 
the relation between rank and winning percentage provides a clean overall test of 
our model. The probability for a g i ven le ad time is motivated by the well-known. 



but mysterious arcsine law (|Felled (|1968|) ). According to this law, the trajectory 



of a one-dimensional random walk is likely to always be on one side of the origin 
rather than the walk spending equal amounts of time to the left and to the right of 
the origin. The ramification of the arcsine law for basketball is that a single team is 
likely to lead for the most of the game rather than both teams to equally sharing the 
time in the lead. As a corollary to the arcsine law, there are typically a//V crossings 
of the origin for a one-dimensional random walk of A'^ steps, and the distribution in 
the number of lead changes is Gaussian. These origin crossings correspond to lead 
changes in basketball games. 
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Figure 8: (a) Winning percentage as a function of team rank. The data (circles) cor- 
respond to the 1991-2010 NBA seasons. The solid curve is the simulated win/loss 
record when the team strength variance = 0.0083. The dashed curve is the simu- 
lated win/loss record if all teams have equal strength, = 0. (b) Probability that a 
randomly-selected team leads for a given total time, (c) Probability for the number 
of lead changes per game: data (o) and simulation (curve). Simulations were run 
for 10^ seasons with = 0.0083. 



For each of the four empirical observables listed above, we compare game 
data with the corresponding simulation results for a given value of the team strength 



variance ■ We quantify the quality of fit between the game data and the simulation 
results by the value defined by 



(12) 



X 



Here Fe{x) is one of the four above-mentioned empirical observables, Fs{x) is the 
corresponding simulated observable, and x is the underlying variable. For example, 
Fe{x) and Fs{x) could be the empirical and simulated probabilities of the final score 
difference and x would be the final score difference. 



Figure 9: ^ function of for: the score difference distribution at 45.5 

minutes (o), number of lead changes per game (y), distribution of time that a team 
is leading (>), and winning percentage as a function of rank (A). Each point is 
based on simulation of 10-^ seasons. 



Figure |9] shows the values of X^ as a function of crj for the four observables. 
The best fit between the data and the simulations all occur when C7| is in the range 
[0.00665, 0.00895] . To extract a single optimum value for o^, we combine the four 
measurements into a single function. Two simple and natural choices are the 
additive and multiplicative forms 




6 



0.015 




mm{xf) ' 





ixf) ' 



(13) 



where the sum and product are over the four observables, xf is associated with the 
observable, and mm{xf) is its minimum over all a| values. The denominator 



allows one to compare the quality of fit for disparate functions. In the absence of 
any prior knowledge about which statistical measure about basketball scoring is 
most important, we have chosen to weight them equally. With this choice, both /add 
and /j^uit have minima at = 0.0083. Moreover, for this value of aj, the value of 
xf for each observable exceeds its minimum value by no more than 1.095. These 
results suggest that the best fit between our model and empirical data arises when 
we choose = 0.0083. Thus roughly 2/3 of the NBA teams have their intrinsic 
strength in the range 1 ± a/oJ ~ 1 ± 0.09. 



5 Outlook 

From all the play-by-play data of every NBA basketball game over four seasons, 
we uncovered several basic features of scoring statistics. First, the rate of scoring is 
nearly constant during a basketball game, with small correlations between succes- 
sive scoring events. Consequently, the distribution of time intervals between scoring 
events has an exponential tail (Fig. O. There is also a scoring anti-persistence, in 
which a score by one team, is likely to be followed by a score by the opponent be- 
cause of the possession change after each basket. Finally, there is a small restoring 
force that tends to reduce the score difference between competitors, perhaps be- 
cause a winning team coasts as its lead grows or a losing team plays more urgently 
as it falls behind. 

Based on the empirical data, we argued that basketball scoring data is well 
described by a nearly unbiased continuous-time random walk, with the additional 
features of anti-persistence and a small restoring force. Even though there are differ- 
ences in the intrinsic strengths of teams, these play a small role in the random- walk 
picture of scoring. Specifically, the dimensionless measure of the effect of dis- 
parities in team strength relative to stochasticity, the Peclet number, is small. The 
smallness of the Peclet number means that it is difficult to determine the superior 
team by observing a typical game, and essentially impossible by observing a short 
game segment. We simulated our random- walk model of scoring and found that it 
satisfyingly reproduces many statistical features about basketball scoring in NBA 
games. 

This study raises several open issues. First, is the exponential distribu- 
tion of time intervals between scoring events a ubiquitous feature of sports com- 
petitions? We speculate that perhaps ot her free- flowing games, such as lacrosse 
( Everson a nd Goldsmith-Pinkham ( 2008)), socc er ( Dyte and Clarke ( 2000f) ). or hockey 



(iThomas ( i2007.) . .Butt rey. Wa shburn, and Pried (120111) ). will have the same scoring 



pattern as basketball when the time intervals between scores are rescaled by the av- 
erage scoring rate for each sport. It also seems plausible that other tactical metrics. 



such as the times intervals between successive crossings of mid-field by the game 
ball (or puck) may also be described by Poisson statistics. If borne out, perhaps 
there is a universal rule that governs the scoring time distribution in sports. 

Seen through the lens of coaches, fans, and commentators, basketball is a 
complex sport that requires considerable analysis to understand and respond to its 
many nuances. A considerable industry has thus built up to quantify every aspect of 
basketball and thereby attempt to improve a team's competitive standing. However, 
this competitive rat race largely eliminates systematic advantages between teams, 
so that all that remains, from a competitive standpoint, are small surges and ebbs 
in performance that arise from the underlying stochasticity of the game. Thus seen 
through the lens of the theoretical physicist, basketball is merely a random walk 
(albeit in continuous time and with some additional subtleties) and many of the 
observable consequences of the game follow from this random-walk description. 

We thank Guoan Hu for assistance with downloading and processing the 
data and Ravi Heugel for initial collaborations on this project. We also thank Aaron 
Clauset for helpful comments on an earlier version of the manuscript. This work 
was supported in part by NSF grant DMR0906504. 
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